Discrimination Analysis of Lip Motion Features for Multimodal Speaker Identification and Speech-reading
نویسندگان
چکیده
In this thesis a new multimodal speaker/speech recognition system that integrates audio, lip texture, lip geometry, and lip motion modalities is presented. There have been several studies that jointly use audio, lip intensity and/or lip geometry information for speaker identification and speech-reading applications. This work proposes using explicit lip motion information, instead of or in addition to audio, lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: i) Is using explicit lip motion information useful? and ii) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. Several lip motion feature candidates are considered including dense motion features within a bounding box around the lip, lip contour motion features, lip shape features, and combinations of them. Furthermore, a novel two-stage discriminant analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. The fusion of audio, lip texture and lip motion modalities is performed by the so-called Reliability Weighted Summation (RWS) decision rule. Experimental results show that the proposed discriminative analysis significantly improves the unimodal performance of the lip motion modality. Moreover, using explicit lip motion information in addition to audio and lip texture yields further performance gains in bimodal speaker/speech recognition systems.
منابع مشابه
Multimodal speaker/speech recognition using lip motion, lip texture and audio
We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represente...
متن کاملAudio-Visual Correlation Modeling for Speaker Identification and Synthesis
This thesis addresses two major problems of multimodal signal processing using audiovisual correlation modeling: speaker recognition and speaker synthesis. We address the first problem, i.e., the audiovisual speaker recognition problem within an open-set identification framework, where audio (speech) and lip texture (intensity) modalities are fused employing a combination of early and late inte...
متن کامللبخوانی و ادراک گفتار دانشآموزان کمشنوای مدارس ویژۀ کمشنوایان در شهر تهران
Objective: The goal of this study was to evaluate the lip reading ability and Speech perception of hearing impaired students of special schools for the hearing impaired in different speech levels. Materials & Methods: In this cross- sectional study, 44 deaf students (9-12 years old) were selected with multi-stage cluster sampling method, from two special schools for the deaf in Tehran. Tools...
متن کاملAudio-visual Integration in Multimodal Communication
In this paper, we review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip-reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also study the enabling technologies for these research topics, including automatic facial feature trackin...
متن کاملSpeaker and Speech recognition by Audio-Visual lip biometrics
This paper proposes a new robust bi-modal audio visual speech and speaker recognition system by lip-motion and speech biometrics. To increase the robustness of speech and speaker recognition, we have proposed a method using speaker lip motion information extracted from video sequences with low resolution (128 ×128 pixels). In this paper we investigate a biometric system for speech recognition a...
متن کامل